Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

نویسندگان

  • Kornel Laskowski
  • Qin Jin
چکیده

It has long been claimed that spectral envelope features outperform prosodic features on speaker recognition tasks. However, the reasons for such an arrangement are not entirely compelling. In the current work we present some evidence to challenge these claims. We propose that energy found at harmonically related frequencies encodes the acoustic correlates of variables which are typically referred to as prosodic, making harmonic energy summation highly relevant. Its frequent implementation for estimating pitch appears to have gone unnoticed by the speaker recognition community, because pitch estimators quite deliberately discard what they compute, retaining only the abscissa of a maximum. We argue that this latter step renders pitch estimation somewhat ill-suited to speaker recognition tasks. We present the detailed construction of a discrete transform, and a normalization, which are amenable to relatively laconic modeling. With this framework we achieve or exceed the performance of spectral envelope features in nearfield, matched-channel and matched-multisession conditions; performance improves following envelope destruction. We believe these results may have far-reaching consequences. For speech processing in a multitude of applications, they suggest that modeling the harmonic structure in the way we propose is at least as relevant as is modeling other aspects of the signal.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Building of Synthetic Voices from Audio Books

Current state-of-the-art text-to-speech systems produce intelligible speech but lack the prosody of natural utterances. Building better models of prosody involves development of prosodically rich speech databases. However, development of such speech databases requires a large amount of effort and time. An alternative is to exploit story style monologues (long speech files) in audio books. These...

متن کامل

Processing the prosody of oral presentations

Standard advice to people preparing to speak in public is to use a “lively” voice. A lively voice is described as one that varies in intonation, rhythm and loudness: qualities that can be analyzed using speech analysis software. This paper reports on a study analyzing pitch variation as a measure of speaker liveliness. A potential application of this approach for analysis would be for rehearsin...

متن کامل

Incorporating Prosodic with Acoustic information for ISCSLP’2006 Speaker Recognition Evaluation- Robust Cross-Channel Speaker Verification

In this paper, we present our speaker verification (SV) systems for the cross-channel text-independent and dependent speaker verification (TI-SV and TD-SV) tasks of ISCSLP’2006 speaker recognition evaluation (ISCSLP2006-SRE). To address the cross-channel issues and take advantage of the unique characteristics of Mandarin (i.e., tonal language), prosodic contours are modeled to assist the state-...

متن کامل

Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing

We describe a “direct modeling” approach to using prosody in various speech technology tasks. The approach does not involve any hand-labeling or modeling of prosodic events such as pitch accents or boundary tones. Instead, prosodic features are extracted directly from the speech signal and from the output of an automatic speech recognizer. Machine learning techniques then determine a prosodic m...

متن کامل

A lognormal tied mixture model of pitch for prosody based speaker recognition

Statistics of pitch have recently been used in speaker recognition systems with good results. The success of such systems depends on robust and accurate computation of pitch statistics in the presence of pitch tracking errors. In this work, we develop a statistical model of pitch that allows unbiased estimation of pitch statistics from pitch tracks which are subject to doubling and/or halving. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010